Econometrics with R - 1 Simple Regressions

1.1 Models and Data

What is econometrics?

Econometrics = use of statistical methods to analyze economic data
- Econometric methods are used in many other fields, like social science, medicine, ect.
Econometricians typically analyze nonexperimental data

Typical goals of econometric analysis

Estimating relationships between economic variables
Testing economic theories and hypothesis
Forecasting economic variables
Evaluating and implementing government and business policy

Steps in econometric analysis

Economic model (this step is often skipped)
Econometric model

1.1.1 Economic models

Micro- or macromodels, growth models, models of open economies, etc.
Often use optimizing behavior, equilibrium modeling, …
Establish relationships between economic variables
Examples: demand equations, pricing equations, Euler equations …

Economic model of crime (Becker (1968))

An equation for criminal activity is derived, based on utility maximization which results in

y = f(x_1, x_2, \ldots , x_k)

Dependent variable
- y = Hours spent in criminal activities
Explanatory variables x_j
- “Wage” of criminal activities
- Wage for legal employment
- Other income
- Probability of getting caught
- Probability of conviction if caught
- Expected sentence
- Family background
- Talent for Crime, moral character
The functional form of the relationship is not specified
The equation above could have been postulated without economic modeling
- But in this case, the model lacks a theoretical foundation
  - If we have a theoretical model, we can often derive the expected sign of the coefficients or even guess the magnitude
  - This can be compared to the estimated coefficients, and if the expectations are not met, we can search for a rationale

Economic Model of job training and worker productivity

What is effect of additional training on worker productivity?
Formal economic theory not really needed to derive equation but is clearly possible:

wage = f(educ, exper, \ldots , training)

Dependent variable
- wage = hourly wage
Explanatory variables x_j
- educ = years of formal education
- exper = years of work force experience
- training = weeks spent in job training
Other factors may be relevant as well, but these are the most important (?)

1.1.2 Econometric models

Econometric model of criminal activity

The functional form has to be specified
Variables may have to be approximated by other quantities (leading to measurement errors)

crime = \beta_{0} + \beta_{1} { wage } + \beta_{2} { othinc } + \beta_{3} { freqarr } + \beta_{4} { freqconv } + \\ \beta_{5} { avgsen } + \beta_{6} { age } + u

crime … measure of criminal activity
wage … wage for legal employment
othinc … other income
freqarr … frequency of prior arrests
freqcon …frequency of conviction
avgsen … Average sentence length after conviction
age … age
u … error term, which contains unobserved factors (lack of data), like moral character, wage in criminal activity, family background, etc. Oddly enough, it is this error term, which attracts the most attention in econometrics

Econometric model of job training and worker productivity

wage = \beta_0 + \beta_1 educ + \beta_2 exper + \beta_3 training + u

wage … hourly wage
educ … years in formal education
exper … years of workforce experience
training … weeks spent in job training
u … error term representing unobserved determinants of the wage like innate ability, quality of education, family background

\text{ }

As mentioned above, most of econometrics deals with the specification of the error u. As we will see, this is essential for a causal interpretation of the estimates
Econometric models may also be used for hypothesis testing
- For example, the parameter \beta_3 represents the effect of training on wages
  - How large is this effect? Is it even different from zero?

1.1.3 Data

Econometric analysis requires data and there are different kinds of economic data sets
- Cross-sectional data
- Time series data
- Pooled cross sections
- Panel/Longitudinal data
Econometric methods depend on the nature of the data used
- Different data sets lead to different estimation problems. Use of inappropriate methods may lead to misleading results
Cross-sectional data sets
- Sample of individuals, households, firms, cities, states, countries or other units of interest at a given point of time/in a given period
- Cross-sectional observations are more or less independent
- For example, pure random sampling from a population
- Sometimes pure random sampling is violated, e.g., units refuse to respond in surveys, or if sampling is characterized by clustering (this usually leads to autocorrelation, heteroscedasticity or sample selection problems)
- Cross-sectional data are typically encountered in applied microeconomics

# Cross-sectional data set on wages and other characteristics. Look especially at indicator variables
library(wooldridge)
data(wage1) 

head(wage1, 10)

          wage educ exper tenure nonwhite female married numdep smsa northcen south
      1   3.10   11     2      0        0      1       0      2    1        0     0
      2   3.24   12    22      2        0      1       1      3    1        0     0
      3   3.00   11     2      0        0      0       0      2    0        0     0
      4   6.00    8    44     28        0      0       1      0    1        0     0
      5   5.30   12     7      2        0      0       1      1    0        0     0
      6   8.75   16     9      8        0      0       1      0    1        0     0
      7  11.25   18    15      7        0      0       0      0    1        0     0
      8   5.00   12     5      3        0      1       0      0    1        0     0
      9   3.60   12    26      4        0      1       0      2    1        0     0
      10 18.18   17    22     21        0      0       1      0    1        0     0
         west construc ndurman trcommpu trade services profserv profocc clerocc
      1     1        0       0        0     0        0        0       0       0
      2     1        0       0        0     0        1        0       0       0
      3     1        0       0        0     1        0        0       0       0
      4     1        0       0        0     0        0        0       0       1
      5     1        0       0        0     0        0        0       0       0
      6     1        0       0        0     0        0        1       1       0
      7     1        0       0        0     1        0        0       1       0
      8     1        0       0        0     0        0        0       1       0
      9     1        0       0        0     1        0        0       1       0
      10    1        0       0        0     0        0        0       1       0
         servocc    lwage expersq tenursq
      1        0 1.131402       4       0
      2        1 1.175573     484       4
      3        0 1.098612       4       0
      4        0 1.791759    1936     784
      5        0 1.667707      49       4
      6        0 2.169054      81      64
      7        0 2.420368     225      49
      8        0 1.609438      25       9
      9        0 1.280934     676      16
      10       0 2.900322     484     441

# or
library(gt) # for pretty html-table plots

gt(head(wage1,10))

wage	educ	exper	tenure	female	married	numdep	smsa	west	trade	services	profserv	profocc	clerocc	servocc	lwage	expersq	tenursq
3.10	11	2	0	1	0	2	1	1	0	0	0	0	0	0	1.131402	4	0
3.24	12	22	2	1	1	3	1	1	0	1	0	0	0	1	1.175573	484	4
3.00	11	2	0	0	0	2	0	1	1	0	0	0	0	0	1.098612	4	0
6.00	8	44	28	0	1	0	1	1	0	0	0	0	1	0	1.791759	1936	784
5.30	12	7	2	0	1	1	0	1	0	0	0	0	0	0	1.667707	49	4
8.75	16	9	8	0	1	0	1	1	0	0	1	1	0	0	2.169054	81	64
11.25	18	15	7	0	0	0	1	1	1	0	0	1	0	0	2.420368	225	49
5.00	12	5	3	1	0	0	1	1	0	0	0	1	0	0	1.609438	25	9
3.60	12	26	4	1	0	2	1	1	1	0	0	1	0	0	1.280934	676	16
18.18	17	22	21	0	1	0	1	1	0	0	0	1	0	0	2.900322	484	441

Time series data
- Observations of a variable or several variables over time
- For example, stock prices, money supply, consumer price index, gross domestic product, annual homicide rates, automobile sales, …
- Time series observations are typically serially correlated
- Ordering of observations conveys important information
- Data frequency: daily, weekly, monthly, quarterly, annually, high frequency data
- Typical features of time series: trends and seasonality
- Typical applications: applied macroeconomics and finance

# Time series data on minimum wages and related variables for Puerto Rico
library(gt) # for pretty html-table plots
library(wooldridge)
data(prminwge)

gt( prminwge[1:20, c("year", "avgmin", "avgcov", "prunemp", "prgnp")] )

year	avgmin	avgcov	prunemp	prgnp
1950	0.198	0.201	15.4	878.7
1951	0.209	0.207	16.0	925.0
1952	0.225	0.226	14.8	1015.9
1953	0.311	0.231	14.5	1081.3
1954	0.313	0.224	15.3	1104.4
1955	0.369	0.236	13.2	1138.5
1956	0.447	0.245	13.3	1185.1
1957	0.488	0.244	12.8	1221.8
1958	0.555	0.238	14.2	1258.4
1959	0.588	0.260	13.3	1363.6
1960	0.616	0.270	11.8	1473.2
1961	0.608	0.269	12.7	1562.8
1962	0.707	0.279	12.8	1683.9
1963	0.723	0.279	11.0	1820.7
1964	0.809	0.294	11.2	1916.8
1965	0.834	0.302	11.7	2083.0
1966	0.854	0.444	12.3	2223.2
1967	0.971	0.448	11.6	2328.4
1968	1.104	0.455	10.3	2455.3
1969	1.149	0.455	10.3	2684.0

Pooled cross sections
- Two or more cross sections are combined in one data set
- Cross sections are drawn independently of each other
- Pooled cross sections often used to evaluate policy changes
Example:
- Evaluate effect of change in property taxes on house prices
  - Random sample of house prices for the year 1993
  - A new random sample of house prices for the year 1995
  - Compare before/after (1993: before reform, 1995: after reform)

Panel or longitudinal data
- The same cross-sectional units are followed over time. Therefore, wide panels are basically pooled crossections with the very same units (which are many)
- Long panels are time series for several units (e.g., countries or counties)
- Panel data have a cross-sectional and a time series dimension. So we have two id-variables
- Panel data can be used to account for time-invariant unobservable factors
- Panel data can also be used to model lagged responses
Example:
- City crime statistics; each city is observed for serveral years
  - Time-invariant unobserved city characteristics may be modeled
  - Effect of police on crime rates may exhibit time lag

# Panel data set on city crime statistics
library(wooldridge)
data(countymurders)

gt( countymurders[ (countymurders$year >= 1990 & countymurders$countyid <= 1005), 
                      c("countyid", "year", "murders", "popul", "percblack", 
                        "percmale", "rpcpersinc")] )

countyid	year	murders	popul	percblack	percmale	rpcpersinc
1001	1990	1	34512	20.19000	40.46000	10975.24
1001	1991	1	35024	20.27000	40.48000	11152.39
1001	1992	1	35560	20.34000	40.51000	11263.97
1001	1993	1	37027	20.48505	48.68339	11312.82
1001	1994	1	38027	20.64849	48.71013	11541.15
1001	1995	5	38957	20.87686	48.72552	11680.74
1001	1996	7	40061	20.97551	48.70073	11852.76
1003	1990	7	99200	13.01000	41.30000	11600.30
1003	1991	3	102224	13.04000	41.37000	11854.09
1003	1992	5	105344	13.07000	41.43000	12124.56
1003	1993	7	111018	13.17624	48.69210	12645.61
1003	1994	5	115266	13.28579	48.73163	13012.65
1003	1995	13	119373	13.42347	48.82176	13327.95
1003	1996	6	123023	13.49666	48.83233	13583.02
1005	1990	4	25532	44.22000	39.38000	9997.83
1005	1991	4	25728	44.44000	39.43000	10371.41
1005	1992	0	25932	44.67000	39.44000	11039.38
1005	1993	3	26461	45.28930	48.74721	10721.85
1005	1994	3	26445	45.70240	49.01115	10912.72
1005	1995	3	26337	46.00372	49.12860	10702.64
1005	1996	1	26475	46.19075	49.15203	10760.51

1.2 Causality

Definition of causal effect of x on y: \ \ x \rightarrow y

How does variable y change if variable x is changed but all other relevant factors are held constant
- Most economic questions are ceteris paribus questions
- It is useful to describe how an experiment would have to be designed to infer the causal effect in question (see examples below)

Simply establishing a relationship – correlation – between variables is not sufficient. Correlation alone says nothing about causality !!!

The question is, whether a found effect (correlation) between x and y can be considered as causal. There are several possibilities:
- x \rightarrow y
- x \leftarrow y
- x \leftrightarrows y
- z_j \rightarrow x \text{ and } z_j \rightarrow y, \ \ldots
If we have controlled for enough other variables z_j, then the estimated ceteris paribus effect can often be considered to be causal (but not always, as not all variables are observable) ¹
However, it is typically difficult to establish causality and we always need some identifying assumptions, which should be credible

1.2.1 Some Examples

“Post hoc, ergo propter hoc” fallacy ²

Figure 1.1: Does carring an umbrellar in the morning causes rainfall in the afternoon? What case?

Further examples

Causal effect of fertilizer on crop yield

“By how much will the production of soybeans increase if one increases the amount of fertilizer applied to the ground”
Implicit assumption: all other factors z_j that influence crop yield such as quality of land, rainfall, presence of parasites etc. are held fixed

Experiment:

Choose several one-acre plots of land; randomly assign different amounts of fertilizer to the different plots; compare yields
Experiment works because amount of fertilizer applied is unrelated to other factors (including the original crop yield y) influencing crop yields

Measuring the return to education

“If a person is chosen from the population and given another year of education, by how much will his or her wage increase?”
Implicit assumption: all other factors z_j that influence wages such as experience, family background, intelligence etc. are held fixed

Experiment:

Choose a group of people; randomly assign different amounts of education to them (infeasible!); compare wage outcomes
Problem without random assignment: amount of education is related to other factors that influence wages (e.g., intelligence or diligence);
this is a (z_j \rightarrow x \text{ and } z_j \rightarrow y) – problem

Effect of law enforcement on city crime level

“If a city is randomly chosen and given ten additional police officers, by how much would its crime rate fall?”
Alternatively: “If two cities are the same in all respects, except that city A has ten more police officers than city B, by how much would the two cities‘ crime rates differ?”

Experiment:

Randomly assign number of police officers to a large number of cities
In reality, number of police officers will be determined by crime rate – simultaneous determination of crime and number of police;
this is mainly a x \leftrightarrows y – problem

Effect of the minimum wage on unemployment

“By how much (if at all) will unemployment increase if the minimum wage is increased by a certain amount (holding other things fixed)?”

Experiment:

Government randomly chooses minimum wage each year and observes unemployment outcomes. The experiment will work because level of minimum wage is unrelated to other factors determining unemployment
In reality, the level of the minimum wage will depend on political and economic factors that also influence unemployment;
mainly a (z_j \rightarrow x \text{ and } z_j \rightarrow y) – problem

1.3 The Simple Regression Model

Definition of the simple linear regression model:

y = \beta_0 + \beta_1 x + u \tag{1.1}

Thereby
- \ y … Dependent variable, explained variable, response variable or regressand
- \ x … Independent variable, explanatory variable or regressor
- \ \beta_0 … Intercept
- \ \beta_1 … Slope parameter
- \ u … Error term, disturbance, unobserved factors with E(u)=0, which is not restrictive because of \beta_0

This is a simple regression model, because we have only one explanatory variable.

Equation 1.1 describes what change in y we can expect if x changes. If follows:

\dfrac {dE(y|x)}{dx} \ = \ \beta_1 + \dfrac {dE(u|x)}{dx} \ = \ \beta_1

as long as \dfrac {dE(u|x)}{dx} = 0
Interpretation of \beta_1: By how much does the dependent variable change (on average, as u always vary in some way) if the independent variable is increased by one unit?
- This interpretation is only correct if all other things (contained in u) remain (on average) constant when the independent variable x is increased by one unit!

Remark: The simple linear regression model is rarely applicable in practice but its discussion is useful for pedagogical reasons

Using a simple regression model we usually have a \ (z_j \rightarrow x \text{ and } z_j \rightarrow y) – problem rendering the causal interpretation of \beta_1 incorrect in most cases

1.3.1 Some Examples

\text{ }

A simple wage equation: wage = \beta_0 + \beta_1 educ + u
- \beta_1 measures the change in hourly wage given another year of education, holding all other factors fixed
- u represents labor force experience, tenure with current employer, work ethic, intelligence, etc. \text{ }
Soybean yield and fertilizer: yield = \beta_0 + \beta_1 fertilizer + u
- \beta_1 measures the effect of fertilizer on yield, holding all other factors fixed
- u represents unobserved (or omitted) factors like Rainfall, land quality, presence of parasites, etc.

1.3.2 Conditional mean independence assumption

When is a causal interpretation of Equation 1.1 justified?

Conditional mean independence assumption

E(u \, | \, x) = E(u) = 0 \tag{1.2}

The explanatory variable must not contain any information about the mean of the unobserved factors in u
- So knowing something about x doesn‘t give us information about u
- This leads to \frac {dE(u \mid x)}{dx}=0 as required. If this assumption is satisfied, we actually have a (x \rightarrow y) – case
Regarding the wage example wage = \beta_0 + \beta_1 educ + u ability is likely an important, but often unobserved factor for the obtained wage of a particular individual. As ability is not an explicit variable in the model, it is contained within u
- The conditional mean independence assumption is unlikely to hold in this case because individuals with more education will also be more capable on average. Knowing something about the education (variable x) of a particular individual therefore contains some information about the ability of that individual (which is in u)
  - Hence, E(u \, | \, x) \neq 0 easily possible in this case
  - Basically, we have the (z_j \rightarrow x \text{ and } z_j \rightarrow y) – problem, with z_j being ability
Regarding the fertilizer example a similar argument holds. Typically, a framer uses more fertilizer if the quality of the soil is bad. Therefore, quality of the soil, which is part of u, influences both crop yield and the amount of fertilizer used. Hence, we once again have a (z_j \rightarrow x \text{ and } z_j \rightarrow y) – problem, with z_j being quality of soil
- And furthermore, E(u \, | \, x) \neq 0, as the amount of used fertlizer (variable x) gives as information about the quality of soil, which is part of u \; \Rightarrow \; conditional mean independence assumption is probably violated in this case

1.3.3 Population regression function (PRF)

Taking the conditional expectation of Equation 1.1 we arrive to the so called population (true) regression function

E(y \, | \, x) \ = \ E(\beta_0 + \beta_1 x + u \, | \, x) \ = \ \beta_0 + \beta_1 x + \underbrace {E(u \, | \, x)}_{= \, 0} \tag{1.3}

Because of Equation 1.2, this implies

E(y \, | \, x) \ = \ \beta_0 + \beta_1 x \tag{1.4}

This means that the average value of the dependent variable can be expressed as a linear function of the explanatory variable and Equation 1.4 is, in a certain sense, the best possible predictor of y, given the information x and assumption Equation 1.2
Furthermore, \beta_1 = \dfrac {dE(y|x)}{dx} That means that a one-unit increase of x changes the conditional expected value (the average) of y by the amount of \beta_1 (if the conditional mean independence assumption is met)
For a given value of x, the distribution of y is centered around E(y|x), as illustrated by in Figure 1.2 which shows a graphical representation of the population regression function

Figure 1.2: Population regression line; Source: Wooldridge (2019)

1.3.4 Estimation

In order to estimate the regression model one needs data, i.e., a random sample of n observations (y_i, x_i), \ i=1, \ldots , n
The task is: Fit as good as possible a regression line through the data points which is an estimation of the PRF:

\hat y_i = \hat \beta_0 + \hat \beta_1 x_i \tag{1.5}

The following Figure 1.3 gives an illustration of this problem

Figure 1.3: Estimated regression line; Source: Wooldridge (2019)

Principle of ordinary least squares – OLS

What does “as good as possible” mean?

We define the regression residuals \hat u_i as (note, a hat, “^”, always denotes an estimated value)

\hat u_i \ \equiv \ y_i - \hat y_i \ = \ y_i - \underbrace {\hat \beta_0 - \hat \beta_1 x_i}_{\hat y_i} \tag{1.6}

We choose \hat \beta_0 and \hat \beta_1 so as to minimize the sum of squared regression residuals

\underset {\hat \beta_0, \hat \beta_1} {\operatorname {min}} \ \sum_{i=1}^n \hat u_i^2 \ \ \rightarrow \ \ \hat \beta_0, \, \hat \beta_1 \tag{1.7}

The resulting first order conditions are

\dfrac {\partial}{\partial \hat \beta_0}\sum_{i=1}^n \hat u_i^2 \ = \ \dfrac {\partial}{\partial \hat \beta_0}\sum_{i=1}^n (y_i - \hat \beta_0 - \hat \beta_1 x_i)^2 \ =

\quad \quad \quad \sum_{i=1}^n -2 (y_i - \hat \beta_0 - \hat \beta_1 x_i) \overset {!}{=} 0

\dfrac {\partial}{\partial \hat \beta_1}\sum_{i=1}^n \hat u_i^2 \ = \ \dfrac {\partial}{\partial \hat \beta_1}\sum_{i=1}^n (y_i - \hat \beta_0 - \hat \beta_1 x_i)^2 \ =

\quad \quad \quad \quad \sum_{i=1}^n - 2x_i (y_i - \hat \beta_0 - \hat \beta_1 x_i) \overset {!}{=} 0

From these first order conditions above we immediately arrive the so called Normal Equations, which are two linear equations in the two variables \hat \beta_0 and \hat \beta_1

\sum_{i=1}^n (\underbrace {y_i - \hat \beta_0 - \hat \beta_1 x_i}_{\hat u_i})= 0 \tag{1.8}

\sum_{i=1}^n x_i ( {y_i - \hat \beta_0 - \hat \beta_1 x_i}) = 0 \tag{1.9}

Dividing by n we get from the first normal Equation 1.8

\frac {1}{n} \sum_{i=1}^n y_i - \hat \beta_0 - \hat \beta_1 \frac {1}{n}\sum_{i=1}^n x_i = 0

This imply

\bar y = \hat \beta_0 + \hat \beta_1 \bar x \ \ \Rightarrow \ \ \hat \beta_0 = \bar y - \hat \beta_1 \bar x \tag{1.10}

For calculating the slope parameter \beta_1 we insert Equation 1.10 into the second normal equation, Equation 1.9

\sum_{i=1}^n x_i (y_i - \underbrace {(\bar y - \hat \beta_1 \bar x)}_{\hat \beta_0} - \hat \beta_1 x_i) = 0

Dividing by n and expanding the sum leads to

\frac {1}{n} \sum_{i=1}^n x_i y_i - \bar y \frac {1}{n} \sum_{i=1}^n x_i + \hat \beta_1 \bar x \frac {1}{n} \sum_{i=1}^n x_i - \hat \beta_1 \frac {1}{n} \sum_{i=1}^n x_i^2 = 0 \ \ \Rightarrow

\frac {1}{n} \sum_{i=1}^n x_i y_i - \bar y \bar x + \hat \beta_1 \bar x^2 - \hat \beta_1 \frac {1}{n} \sum_{i=1}^n x_i^2 = 0

Collecting terms by applying the “Steinerschen Verschiebungssatz” we get

\frac {1}{n} \sum_{i=1}^n (x_i - \bar x) (y_i - \bar y) - \hat \beta_1 \frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 = 0

This immediately leads to the OLS formula for the slope parameter

\hat \beta_1 \, = \, \dfrac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) (y_i - \bar y)}{\frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 } \, = \, \dfrac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) y_i}{\frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 } \tag{1.11}

This equals the sample covariance of y and x divided by the sample variance of x

Formula Equation 1.11 is only defined if there is some variation in the explanatory variable x, i.e., the sample variance of x must not be zero

After having calculated \hat \beta_1 by the formula in Equation 1.11 we get \hat \beta_0 by inserting \hat \beta_1 into formula Equation 1.10

Algebraic properties of OLS

The first normal equation, Equation 1.8, imply:

Regression line always passes through the sample midpoint (\bar x, \bar y), according Equation 1.10
The sum (and average) of the residuals is zero: \sum_{i=1}^n \hat u_i = 0 according to Equation 1.8 and the definition in Equation 1.6

Furthermore, the second normal equation, Equation 1.9 together with the definition of the residuals Equation 1.6 implies:

The regressor x_i and the regression residuals \hat u_i are orthogonal:
\sum_{i=1}^n x_i \hat u_i=0 i.e., are uncorrelated

This is the extreme important orthogonal property of OLS

Estimation by Methods of Moments

Another approach for estimating the (true) population parameters \beta_0 and \beta_1 is the method of moments procedure, MoM

The basis for this is the conditional mean independence assumption, Equation 1.2 E(u \, | \, x) = E(u) = 0 This implies that the covariance between u and x is zero:

\operatorname {Cov}(x,u) \ = \ E \left[ (x-E(x)) \, (u-0) \right] \ =

E(x \, u) - E(x) \underbrace {E(u)}_0 \ = \ E(x \, u) \quad \Rightarrow

E(x \, u) = E_x [x \underbrace {E(u | x)}_0 ] = 0

Hence, we have two (population) moment restrictions

E(u) \ = \ E(\underbrace {y-\beta_0-\beta_1 x)}_u=0 \tag{1.12}

E(x \, u) \ = \ E[x \, (y-\beta_0-\beta_1 x)]=0 \tag{1.13}

The method of moments approach to estimate the parameters imposes these two population moments restrictions on the sample data

In particular: the population moments are replaced by their sample counterparts
The justification is as follows: By the Law of Large Numbers, LLN, the sample moments converge to their population/theoretical counterparts under rather weak assumptions (stationarity, weak dependence). E.g., with increasing sample size n the sample mean of a random variable converge to the expectation of this random variable (compare Theorem A.2)
So we can estimate the population moments by the corresponding empirical moments. In particular, we estimate the expectation, E(y), with the arithmetic sample mean \bar y, knowing that by the LLN this sample estimator converges to E(y) with increasing sample size
Hence, the population moment conditions, Equation 1.12 and Equation 1.13, can be replaced (estimated) by their corresponding sample means:

\frac {1}{n} \sum_{i=1}^n (y_i-\hat \beta_0-\hat \beta_1 x_i)=0

\frac {1}{n} \sum_{i=1}^n x_i \, (y_i-\hat \beta_0-\hat \beta_1 x_i)=0

However, the above conditions (which the parameters \beta_0 and \beta_1 have to meet) are exactly the same as the first order conditions from minimizing the sum of squared residuals, the normal equations, Equation 1.8 and Equation 1.9, and therefore yield the same solutions.

Hence, OLS and MoM estimation yield the very same estimated parameters \hat \beta_0 and \hat \beta_1 in this case. (For an additional analysis of MoM estimation, see Section 2.4.1)
Furthermore, the OLS estimator is also equal to the maximum likelihood estimator, ML, assuming normally distributed error terms
- Maximum likelihood estimation is treated in more detail in Section 10.2. Intuitively, ML means that – for a given sample – the estimated parameters are chosen such that the probability of obtaining the respective sample is maximized
Under standard assumptions, OLS, MoM and ML estimators are equivalent (but generally, they can be different!)

1.3.5 An example in R

Install R from https://www.r-project.org
Install RStudio from https://rstudio.com/products/rstudio/download/#download
Start RStudio and install the packages AER and Wooldridge (which we will need very often). For that purpose go to the lower right window, choose the tab Packages, then the tab Install and enter AER and then click Install. If you are asked during the installation whether you want to compile code, type: no (in the lower left window). Repeat the same for the package Wooldridge
To input code use the upper left window. To execute code, mark the code in the upper left window and click on the tap Run at the top of the upper left window
You will see the results in the lower left window
To run the examples from these slides, simply copy the code from the slides (shaded in grey) into the upper left window, mark it and run it

We want to investigate to what extent the success in an election is determined by the expenditures during the campaign.

# We use a data set contained in the "Wooldridge" package 

# We already installed this package, however, if we want to use it in R,  
# we additionally have to load it with the library() command
library(wooldridge)

# Loading the data set "vote1" from the Wooldridge packages with the "data" command
data(vote1)

# printing out the first 6 observation of the data set "vote1" with the command "head()"
head(vote1)

        state district democA voteA expendA expendB prtystrA lexpendA lexpendB
      1    AL        7      1    68 328.296   8.737       41 5.793916 2.167567
      2    AK        1      0    62 626.377 402.477       60 6.439952 5.997638
      3    AZ        2      1    73  99.607   3.065       55 4.601233 1.120048
      4    AZ        3      0    69 319.690  26.281       64 5.767352 3.268846
      5    AR        3      0    75 159.221  60.054       66 5.070293 4.095244
      6    AR        4      1    69 570.155  21.393       46 6.345908 3.063064
          shareA
      1 97.40767
      2 60.88104
      3 97.01476
      4 92.40370
      5 72.61247
      6 96.38355

Running a regression of voteA on shareA with the command `lm()` (for linear model)

out <- lm(voteA ~ shareA, data=vote1)
# We stored the results in a list with the freely chosen name "out"

# With coef(out) we print out the estimated coefficients 
# Try to interpret the estimated coefficients 
coef(out)

      (Intercept)      shareA 
       26.8122141   0.4638269

# With fitted(out) we store the fitted values
yhat <- fitted(out)

# With residuals(out) we store the residuals
uhat <- residuals(out)

Checking the orthogonal property of OLS – the correlation between explanatory variable x and the residuals \hat u.

round( cor(uhat, vote1$shareA), digits = 14)

      [1] 0

# Previous plot plus estimated regression line.
plot(voteA ~ shareA, data=vote1)
abline(out)

Plotting residuals. These should show no systematic pattern.

plot(uhat)
abline(0,0)

Plotting predicted values versus actual values of voteA. Are predictions biased?

plot(yhat ~ voteA, data=vote1)

# 45° line
abline(0,1)

Plotting squared residuals versus fitted values. Useful for detecting a varying variance (heteroscedasticity)

plot(uhat^2 ~ yhat, data=vote1)

Discussion of example

This simple model for the success in an election seems very plausible, however it suffers from a very common problem

In this particular example, the conditional mean independence assumption is almost certainly violated. Why?
- Because the campaign expenditures strongly depend on donations from supporters. The stronger a candidate is in a particular district the more donations he will get and the higher will be the potential campaign expenditures
- Hence, we have a reversed causality problem here, \ x \leftrightarrows y, or a third variable problem z_j \rightarrow x \text{ and } z_j \rightarrow y, which both lead to E(u|x) \neq 0 in general
- This probably will lead to a strong overestimation of the effects of campaign expenditures on votes in this particular case
Note that although x is very likely correlated with unobserved factors in u, the example above showed that the correlation between x and the sample residuals \hat u is zero – orthogonality property of OLS. Hence, this fact says nothing about whether the conditional mean independence assumption is satisfied or not
A possible remedy: Multiple regression model (with variables z as additional variables included in the set of explanatory variables) or tying to identify the x \rightarrow y relationship with external information (like instrumental variables; we will deal with this approach in Chapter 7)

1.3.6 Measures of Goodness-of-Fit

How well does the explanatory variable explain the dependent variable?

Measures of Variation

SST = \sum\nolimits_{i=1}^n (y_i - \bar y)^2, \quad SSE = \sum\nolimits_{i=1}^n (\hat y_i - \bar y)^2, \quad SSR =\sum\nolimits_{i=1}^n \hat u_i^2

SST is total sum of squares, represents total variation in the dependent variable
SSE is explained sum of squares, represents variation explained by regression
SSR is residual sum of squares, represents variation not explained by regression

Decomposition of total variation, (because of y_i = \hat y_i + \hat u_i, \sum_i x_i \hat u_i=0 and \sum_i \hat u_i=0)

SST = SSE + SSR \tag{1.14}

Goodness-of-Fit measure

R^2 \ \equiv \ \dfrac {SSE}{SST}\ = \ 1 - \dfrac {SSR}{SST} \tag{1.15}

The R-squared measures the fraction of the total variation in y that is explained by the regression

Example

# Running once more the regression of voteA on shareA with the command lm() 
out <- lm(voteA ~ shareA, data=vote1)

# Printing a summary of the regression
summary(out)

      
      Call:
      lm(formula = voteA ~ shareA, data = vote1)
      
      Residuals:
           Min       1Q   Median       3Q      Max 
      -16.8919  -4.0660  -0.1682   3.4965  29.9772 
      
      Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
      (Intercept) 26.81221    0.88721   30.22   <2e-16 ***
      shareA       0.46383    0.01454   31.90   <2e-16 ***
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      
      Residual standard error: 6.385 on 171 degrees of freedom
      Multiple R-squared:  0.8561,  Adjusted R-squared:  0.8553 
      F-statistic:  1018 on 1 and 171 DF,  p-value: < 2.2e-16

# Caution: A high R-squared does not mean that the regression has a causal interpretation!

1.3.7 Statistical Properties of OLS

The OLS parameters estimates (estimated coefficients) are functions of random variables and thus random variables themselves
We are interested in the moments and the distribution of the estimated coefficients, especially in the expectations and variances
Three questions are of particular interest:
- Are the OLS estimates unbiased, i.e., E(\hat \beta_i) = \beta_i \, ?
- How precise are our parameter estimates, i.e., how large is their variance \operatorname {Var}(\hat \beta_i) \; ?
- How are the estimated OLS coefficients distributed?

Unbiasedness of OLS

Theorem 1.1 (Unbiasedness of OLS) Given a random sample and conditional mean independence of u_i from x we state:

E(\hat \beta_0)=\beta_0, \ \ E(\hat \beta_1)=\beta_1

Proof of Theorem 1.1

From Equation 1.11 we have

\hat \beta_1 = \dfrac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) y_i}{\frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 } \ \tag{1.16}

We substitute for y_i = \beta_0 + \beta_1 x_i + u_i

\hat \beta_1 = \dfrac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) (\beta_0 + \beta_1 x_i + u_i)}{\frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 } \ =

\beta_0 \underbrace{ \left[ \dfrac { \frac{1}{n} {\sum_{i=1}^n (x_i - \bar x)} }{\frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 }\right]}_0 + \beta_1 \underbrace{ \left[ \dfrac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) x_i }{\frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 }\right]}_1 + \dfrac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) u_i }{\frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 } \ \ \Rightarrow

\hat \beta_1 \, = \, \beta_1 + \underbrace{\frac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) u_i } { \frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2}}_{s_x^2} \tag{1.17}

Taking the conditional expectation, considering the conditional mean independence assumption

E(\hat \beta_1 | x_1, \ldots, x_n ) \ = \ \beta_1 + \dfrac {1}{s_x^2} \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) \underbrace {E [ u_i | x_1, \ldots, x_n ]}_0 \ = \ \beta_1 \ \ \text{ and }

E(\hat \beta_1) = E_x [E(\hat \beta_1 | x_1, \ldots, x_n )] = E_x(\beta_1)=\beta_1

by the law of iterated expectations

Interpretation of unbiasedness

The estimated coefficients may be smaller or larger than the true values, depending on the sample which is the result of a random draw
However, on average, they will be equal to the true value (on average means with regard to repeated samples)
In a given sample, estimates may differ considerably from true values

Variances of the OLS estimates

Depending on the sample, the estimates will be nearer or farther away from the true values
How far can we expect our estimates to be away from the true population values on average? (=sampling variability or sampling errors)
Sampling variability is measured by the estimators’ variances
We need an additional assumption to easily calculate these variances:

Homoscedasticity of u_i

\operatorname{Var}(u_i| x_1, \ldots, x_n) = \sigma^2 \tag{1.18}

The values of the explanatory variable must not contain any information about the variability of the unobserved factors
Together with the conditional mean independence assumption this furthermore implies that the conditional variance of u is also equal to the unconditional variance of u

\operatorname{Var}(u) = E_x [ \underbrace{ E ( u^{2} | x)-[ E(u | x)]^{2} }_{ \operatorname{Var}(u_i| x_1, \ldots, x_n) = \sigma^2 } ] = E_x [ E ( u^{2} | x ) ] = E_x(\sigma^2) = \sigma^2

The square root of \sigma^2 is \sigma, the standard deviation of the error

Example: y = f(x) + u

Figure 1.4: Homoscedastic errors; Source: Wooldridge 2020

Example: wage = f(education) + u

Figure 1.5: Heteroscedastic errors; Source: Wooldridge 2020

Theorem 1.2 (Variance of OLS estimators) Under random sampling, conditional mean independence of u_i from x and homoscedasticity we have

\operatorname{Var}(\widehat{\beta}_{1} | x_1, \ldots , x_n ) = \dfrac{\sigma^{2}}{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}=\dfrac{\sigma^{2}}{S S T_{x}} = \frac {1}{n} \dfrac {\sigma^2}{s_x^2} \tag{1.19}

\operatorname{Var} (\widehat{\beta}_{0} | x_1, \ldots , x_n ) = \dfrac{\sigma^{2} \frac {1}{n} \sum_{i=1}^{n} x_{i}^{2}}{\sum_{i=1}^{n}\left(x_{i} - \bar{x}\right)^{2}} = \dfrac{\sigma^{2} \, \bar {x^{2}}} {SST_{x}} = \frac {1}{n} \dfrac {\sigma^2}{s_x^2} \, \bar {x^{2}} \tag{1.20}

Proof of Theorem 1.2

From the proof of Theorem 1.1, we use Equation 1.17

\hat \beta_1 \ = \ \ \beta_1 + \dfrac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) u_i }{ {s_x^2} }

Hence, according to Equation 1.18 and random sampling we have

\operatorname{Var}(\hat \beta_1|x_1, \ldots, x_n) \ = \ \dfrac { \frac{1}{n^2} \sum_{i=1}^n (x_i - \bar x)^2 \operatorname {Var} ( u_i |x_1, \ldots, x_n )}{ { (s_x^2)^2 } } \ = \ \dfrac {\sigma^2}{SST_x} \ = \ \frac {1}{n} \dfrac {\sigma^2}{s_x^2}

For the unconditional variance we have, (which is rarely used)

\operatorname{Var}(\hat \beta_1) \ \equiv \ E \left[ \left( \hat \beta_1-E(\beta_1) \right)^2 \right] \ = \ \ E_x \! \left[ \underbrace { E \left( (\hat \beta_1 - \beta_1)^2|x_1, \ldots, x_n \right) }_{\operatorname{Var}(\hat \beta_1|x_1, \ldots, x_n]} \right] \\ = \ E_x \! \left[ \dfrac {\sigma^2}{n \, s_x^2} \right] \ = \ \dfrac {1}{n} E_x \! \left[ \dfrac {\sigma^2}{s_x^2} \right]

The sampling variability of the estimated regression coefficients will be the lower,

the smaller the variability of the unobserved factors \sigma^2
the higher the variation in the explanatory variable s_x^2
- In particular, the ratio of \sigma / s_x is crucial
the larger the sample size n

Estimating the variance of error term

According to our homoscedasticity assumption the variance of the error term u is independent of the explanatory variables

\operatorname {Var}(u \, | \, x) = \sigma^2 = \operatorname {Var}(u)

However, \sigma^2 is usually unknown, so we need an estimator for this parameter
A natural procedure is to use the variance of the sample residuals (note, \bar {\hat u}_i = 0, which is an OLS property, see Section 1.3.4.2, #2)

\hat \sigma^2 = \dfrac {1}{n-2} \sum_{i=1}^n (\hat u_i - \bar {\hat u}_i)^2 \ = \ \dfrac {1}{n-2} \sum_{i=1}^n \hat u_i^2 \tag{1.21}

\text{and} \quad S.E. \ \equiv \ \hat \sigma \ = \ \sqrt{\hat \sigma^2} \tag{1.22}

This estimator turns out to be unbiased under or assumptions (see Theorem 2.1)
Note that we divide by (n-2) and not by n to calculate the average above. The reason is that for calculating \hat u_i, we priorly need to estimate two parameters, \beta_0, \beta_1. This means, that knowing this two estimated parameters, only (n-2) observations are informative – if we take these two estimated parameters together with (n-2) observations we could infer the remaining two observations. Therefore, this last two observations contain no additional information
- The number (n-2), which is the number of observations minus the number of estimated model parameters, is referred to as degrees of freedom

Standard errors for regression coefficients

Having an estimate for \sigma^2 and the standard error S.E., we are able to estimate the standard errors of the parameter estimates

Calculation of standard errors for regression coefficients

Using formulas Equation 1.19 and Equation 1.22 we arrive to

se(\hat \beta_1) \ = \ \sqrt{\widehat {\operatorname {Var}}(\hat \beta_1 | x_1, \ldots , x_n)} \ = \ \sqrt{\dfrac {\hat \sigma^2}{SST_x} } \tag{1.23}

The estimated standard deviations of the regression coefficients are called standard errors. They measure how precisely the regression coefficients are estimated

The following figures should illustrate the theoretical concepts discussed above

Code

## Monte Carlo simulation for regressions with one explanatory variable 

##################### definition of function #########################################
sims <- function(n=120, rep=5000, sigx=1, sig=1) {
  
  set.seed(13468) # seed for random number generator
  
  # true parameters
  B0 = 0
  B1 = 0.5

  OLS  <- vector(mode = "list", length = rep) # initialing list for storing results
  SOLS <- vector(mode = "list", length = rep) # initialing list for storing results
  
  OLS1  <- vector(mode = "list", length = rep) # initialing list for storing results
  SOLS1 <- vector(mode = "list", length = rep) # initialing list for storing results
  
  ######################### rep loop #################################################
  for (i in (1:rep)) {
    x  =  rnorm(n, mean = 0, sd = sigx)
    u  =  rnorm(n, mean = 0, sd = sig)   
    u1 =  u/2         

    maxx = max(x)
    minx = min(x)
      
    y  = B0 + B1*x + u
    y1 = B0 + B1*x + u1
    
    maxy = max(y)
    miny  = min(y)
      
    OLS[[i]]  =  lm(y ~ x, model = FALSE)
    OLS1[[i]] =  lm(y1 ~ x, model = FALSE)
  }
  ########################## end rep loop ############################################
  
  
  ######################### drawing plots ############################################
  # scatterplot with true and estimated reg-line for last regression
  plot(y ~ x, col="blue")
  abline(OLS[[i]], col="blue")
  abline(c(B0,B1), col="red")
  
  # rep > 100: histogram of estimated parameter b1
  if (rep > 100) {
    b1_distribution <- sapply(OLS, function(x) coef(x)[2])
    hist(b1_distribution, breaks = 30, main="") 
    abline(v=B1, col = "red")
  }

  # true and up to 100 estimated reg-lines 
  plot(NULL, col="blue", xlim = c(minx*1.1, maxx*1.1), ylim = c(miny*1.1, maxy*1.1), ylab="y", xlab="x")
  for ( i in 1:min(100, rep) ) abline(OLS[[i]], col="lightgrey")
  points(y ~ x, col="blue", xlim = c(minx*1.1, maxx*1.1), ylim = c(miny*1.1, maxy*1.1))
  abline(c(B0,B1), col="red")
  
  # true and up to 100 estimated reg-lines, smaller sig
  plot(NULL, col="blue", xlim = c(minx*1.1, maxx*1.1), ylim = c(miny*1.1, maxy*1.1), ylab="y1", xlab="x")
  for ( i in 1:min(100, rep) ) abline(OLS1[[i]], col="lightgrey")
  points(y1 ~ x, col="blue", xlim = c(minx*1.1, maxx*1.1), ylim = c(miny*1.1, maxy*1.1))
  abline(c(B0,B1), col="red") 
  
}
######################### end of function ############################################


## Calling function `sims()` with default values for parameters
sims()

(a) Population regression function (red) and estimated regression function of a particular sample (blue), 120 observations, compare Figure 1.2 and Figure 1.3

1.3.8 Example once more

We repeat the regression output from our voting example. Look for the new concepts we just discussed in the regression output shown blow.

library(modelsummary)

# Running once more the regression of voteA on shareA with the command lm() 
out <- lm(voteA ~ shareA, data=vote1)

modelsummary(list("Vote for candidate A"=out), 
             shape =  term ~ statistic,
             statistic = c('std.error', 'statistic', 'p.value', 'conf.int'), 
             stars = TRUE, 
             gof_omit = "A|L|B|F",
             align = "ldddddd",
             output = "gt")

	Vote for candidate A
	Est.	S.E.	t	p	2.5 %	97.5 %
(Intercept)	26.812***	0.887	30.221	<0.001	25.061	28.564
shareA	0.464***	0.015	31.901	<0.001	0.435	0.493
Num.Obs.	173
R2	0.856
RMSE	6.35
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

If every variable z_j, which influences both x and y is known and observable, x \leftrightarrows y reduces to a z_j \rightarrow x \text{ and } z_j \rightarrow y – problem.↩︎
Lat.: Danach, also deswegen.↩︎

--- title-block-banner: true subtitle: "Based on @wooldridge_Intro_Econometrics, Chaptes 1 and 2" --- # Simple Regressions \pagebreak ## Models and Data #### **What is econometrics?** - Econometrics = use of *statistical methods to analyze economic data* - Econometric methods are used in many other fields, like social science, medicine, ect. - Econometricians typically analyze **nonexperimental** data #### **Typical goals of econometric analysis** - Estimating relationships between economic variables - Testing economic theories and hypothesis - Forecasting economic variables - Evaluating and implementing government and business policy #### **Steps in econometric analysis** 1) Economic model (this step is often skipped) 2) Econometric model ------------------------------------------------------------------------ ### Economic models - Micro- or macromodels, growth models, models of open economies, etc. - Often use optimizing behavior, equilibrium modeling, … - Establish relationships between economic variables - Examples: demand equations, pricing equations, Euler equations … #### Economic model of crime (Becker (1968)) An equation for *criminal activity* is derived, based on **utility maximization** which results in $$ y = f(x_1, x_2, \ldots , x_k) $$ - *Dependent variable* - `y` = Hours spent in criminal activities - *Explanatory variables* $x_j$ - "Wage" of criminal activities - Wage for legal employment - Other income - Probability of getting caught - Probability of conviction if caught - Expected sentence - Family background - Talent for Crime, moral character - The functional form of the relationship is not specified - The equation above could have been postulated without economic modeling - But in this case, the model lacks a theoretical foundation - If we have a theoretical model, we can often derive the expected sign of the coefficients or even guess the magnitude - This can be compared to the estimated coefficients, and if the expectations are not met, we can search for a rationale ------------------------------------------------------------------------ #### Economic Model of job training and worker productivity - What is effect of additional training on worker productivity? - Formal economic theory not really needed to derive equation but is clearly possible: $$ wage = f(educ, exper, \ldots , training) $$ - *Dependent variable* - `wage` = hourly wage - *Explanatory variables* $x_j$ - `educ` = years of formal education - `exper` = years of work force experience - `training` = weeks spent in job training - Other factors may be relevant as well, but these are the most important (?) ------------------------------------------------------------------------ ### Econometric models #### Econometric model of criminal activity - The functional form has to be specified - Variables may have to be approximated by other quantities (leading to *measurement error*s) $$ crime = \beta_{0} + \beta_{1} { wage } + \beta_{2} { othinc } + \beta_{3} { freqarr } + \beta_{4} { freqconv } + \\ \beta_{5} { avgsen } + \beta_{6} { age } + u $$ - `crime` ... measure of criminal activity - `wage` ... wage for legal employment - `othinc` ... other income - `freqarr` ... frequency of prior arrests - `freqcon` ...frequency of conviction - `avgsen` ... Average sentence length after conviction - `age` ... age - *u* ... error term, which contains **unobserved** factors (lack of data), like moral character, wage in criminal activity, family background, etc. Oddly enough, it is this error term, which attracts the most attention in econometrics ------------------------------------------------------------------------ #### Econometric model of job training and worker productivity $$wage = \beta_0 + \beta_1 educ + \beta_2 exper + \beta_3 training + u$$ - `wage` ... hourly wage - `educ` ... years in formal education - `exper` ... years of workforce experience - *training* ... weeks spent in job training - *u* ... error term representing **unobserved** determinants of the wage like innate ability, quality of education, family background $$ \text{ } $$ - As mentioned above, most of econometrics deals with the **specification of the error u**. As we will see, this is **essential** for a **causal interpretation** of the estimates - Econometric models may also be used for **hypothesis testing** - For example, the parameter $\beta_3$ represents the *effect of training on wages* - How large is this effect? Is it even different from zero? ------------------------------------------------------------------------ ### Data - Econometric analysis requires data and there are different kinds of economic data sets - Cross-sectional data - Time series data - Pooled cross sections - Panel/Longitudinal data - Econometric methods depend on the nature of the data used - Different data sets lead to different estimation problems. Use of inappropriate methods may lead to misleading results - **Cross-sectional data sets** - Sample of individuals, households, firms, cities, states, countries or other units of interest at a given point of time/in a given period - Cross-sectional observations are more or less independent - For example, pure random sampling from a population - Sometimes pure random sampling is violated, e.g., units refuse to respond in surveys, or if sampling is characterized by clustering (this usually leads to *autocorrelation*, *heteroscedasticity* or *sample selection problems*) - Cross-sectional data are typically encountered in *applied microeconomics* ------------------------------------------------------------------------ ```{r} #| comment: " " # Cross-sectional data set on wages and other characteristics. Look especially at indicator variables library(wooldridge) data(wage1) head(wage1, 10) # or library(gt) # for pretty html-table plots gt(head(wage1,10)) ``` ------------------------------------------------------------------------ - **Time series data** - Observations of a variable or several variables *over time* - For example, stock prices, money supply, consumer price index, gross domestic product, annual homicide rates, automobile sales, … - Time series observations are typically *serially correlated* - Ordering of observations conveys important information - Data frequency: daily, weekly, monthly, quarterly, annually, high frequency data - Typical features of time series: *trends* and *seasonality* - Typical applications: *applied macroeconomics and finance* ------------------------------------------------------------------------ ```{r} # Time series data on minimum wages and related variables for Puerto Rico library(gt) # for pretty html-table plots library(wooldridge) data(prminwge) gt( prminwge[1:20, c("year", "avgmin", "avgcov", "prunemp", "prgnp")] ) ``` ------------------------------------------------------------------------ - **Pooled cross sections** - Two or more cross sections are combined in one data set - Cross sections are drawn independently of each other - Pooled cross sections often used to *evaluate policy changes* - Example: - Evaluate effect of change in property taxes on house prices - Random sample of house prices for the year 1993 - A new random sample of house prices for the year 1995 - Compare before/after (1993: before reform, 1995: after reform) ------------------------------------------------------------------------ - **Panel or longitudinal data** - The **same** cross-sectional units are followed over time. Therefore, wide panels are basically pooled crossections with the very same units (which are many) - Long panels are time series for several units (e.g., countries or counties) - Panel data have a cross-sectional *and* a time series dimension. So we have two id-variables - Panel data can be used to account for *time-invariant unobservable factors* - Panel data can also be used to model lagged responses - Example: - City crime statistics; each city is observed for serveral years - Time-invariant unobserved city characteristics may be modeled - Effect of police on crime rates may exhibit time lag ------------------------------------------------------------------------ ```{r} # Panel data set on city crime statistics library(wooldridge) data(countymurders) gt( countymurders[ (countymurders$year >= 1990 & countymurders$countyid <= 1005), c("countyid", "year", "murders", "popul", "percblack", "percmale", "rpcpersinc")] ) ``` ------------------------------------------------------------------------ ## Causality Definition of causal effect of $x$ on $y: \ \ x \rightarrow y$ - **How does variable** $y$ **change if variable** $x$ **is changed but all other relevant factors are held constant** - Most economic questions are **ceteris paribus** questions - It is useful to describe how an *experiment* would have to be designed to infer the causal effect in question (see examples below) > **Simply establishing a relationship -- correlation -- between variables is not sufficient. Correlation alone says nothing about causality !!!** - The question is, whether a found effect (correlation) between $x$ and $y$ can be considered as **causal**. There are several possibilities: - $x \rightarrow y$ - $x \leftarrow y$ - $x \leftrightarrows y$ - $z_j \rightarrow x \text{ and } z_j \rightarrow y, \ \ldots$ - If we have controlled for enough other variables $z_j$, then the estimated ceteris paribus effect can often be considered to be causal (but not always, as not all variables are observable) [^the-nature-of-econometrics-1] - However, it is typically difficult to establish causality and we **always** need some **identifying assumptions**, which should be credible [^the-nature-of-econometrics-1]: If **every** variable $z_j$, which influences both $x$ and $y$ is known and observable, $x \leftrightarrows y$ reduces to a $z_j \rightarrow x \text{ and } z_j \rightarrow y$ -- problem. ------------------------------------------------------------------------ ### Some Examples #### **"Post hoc, ergo propter hoc" fallacy** [^the-nature-of-econometrics-2] [^the-nature-of-econometrics-2]: Lat.: Danach, also deswegen. ```{r, echo=FALSE, fig.align='center', fig.width=7.2, fig.asp=0.72} #| label: fig-causality #| fig-cap: "Does carring an umbrellar in the morning causes rainfall in the afternoon? What case?" # spurious regression set.seed(12) umbrella <- runif(60, 0, 100) raining <- pmax( 0.5 * umbrella + rnorm(60, 0, 9), 0 ) / 12 out <- lm(raining ~ umbrella) corr <- round( cor(raining,umbrella), digits = 2) plot(raining ~ umbrella, xlim=c(0,100), ylim=c(0,5), xlab="Share of people who carry an umbrellar in the morning of a particular day [%]", ylab="Actuall raining in the afternoon of that day [l/sqm]") text(20, 4, labels = bquote( "Correlation =" ~ .(corr) ) ) abline(out) ``` ------------------------------------------------------------------------ #### **Further examples** **Causal effect of fertilizer on crop yield** - “By how much will the production of soybeans increase if one increases the amount of fertilizer applied to the ground”\ - Implicit assumption: all other factors $z_j$ that influence crop yield such as quality of land, rainfall, presence of parasites etc. are held fixed *Experiment*: - Choose several one-acre plots of land; *randomly* assign different amounts of fertilizer to the different plots; compare yields\ - Experiment works because amount of fertilizer applied is *unrelated* to other factors (including the original crop yield $y$) influencing crop yields ------------------------------------------------------------------------ **Measuring the return to education** - “If a person is chosen from the population and given another year of education, by how much will his or her wage increase?” - Implicit assumption: all other factors $z_j$ that influence wages such as experience, family background, intelligence etc. are held fixed *Experiment*: - Choose a group of people; *randomly* assign different amounts of education to them (infeasible!); compare wage outcomes - Problem without random assignment: amount of education is related to other factors that influence wages (e.g., intelligence or diligence);\ this is a $(z_j \rightarrow x \text{ and } z_j \rightarrow y)$ -- problem ------------------------------------------------------------------------ **Effect of law enforcement on city crime level** - “If a city is randomly chosen and given ten additional police officers, by how much would its crime rate fall?” - Alternatively: “If two cities are the same in all respects, except that city A has ten more police officers than city B, by how much would the two cities‘ crime rates differ?” Experiment: - *Randomly* assign number of police officers to a large number of cities - In reality, number of police officers will be determined by crime rate -- simultaneous determination of crime and number of police;\ this is mainly a $x \leftrightarrows y$ -- problem ------------------------------------------------------------------------ **Effect of the minimum wage on unemployment** - “By how much (if at all) will unemployment increase if the minimum wage is increased by a certain amount (holding other things fixed)?” Experiment: - Government *randomly* chooses minimum wage each year and observes unemployment outcomes. The experiment will work because level of minimum wage is unrelated to other factors determining unemployment - In reality, the level of the minimum wage will depend on political and economic factors that also influence unemployment;\ mainly a $(z_j \rightarrow x \text{ and } z_j \rightarrow y)$ -- problem ------------------------------------------------------------------------ ## The Simple Regression Model ##### Definition of the simple linear regression model: $$ y = \beta_0 + \beta_1 x + u $$ {#eq-1b.1} - Thereby - $\ y$ ... Dependent variable, explained variable, response variable or regressand\ - $\ x$ ... Independent variable, explanatory variable or regressor\ - $\ \beta_0$ ... Intercept\ - $\ \beta_1$ ... Slope parameter\ - $\ u$ ... Error term, disturbance, unobserved factors with $E(u)=0$, which is *not restrictive* because of $\beta_0$ This is a simple regression model, because we have **only one** explanatory variable. - @eq-1b.1 describes what change in $y$ we can expect if $x$ changes. If follows: $$\dfrac {dE(y|x)}{dx} \ = \ \beta_1 + \dfrac {dE(u|x)}{dx} \ = \ \beta_1$$ as long as $\dfrac {dE(u|x)}{dx} = 0$ - **Interpretation** of $\beta_1$: By how much does the dependent variable change (on average, as $u$ always vary in some way) if the independent variable is increased by one unit? - This interpretation is only correct if *all other things (contained in u) remain (on average) constant* when the independent variable $x$ is increased by one unit! **Remark**: The *simple* linear regression model is *rarely applicable* in practice but its discussion is useful for pedagogical reasons - Using a simple regression model we usually have a $\ (z_j \rightarrow x \text{ and } z_j \rightarrow y)$ -- problem rendering the causal interpretation of $\beta_1$ incorrect in most cases ------------------------------------------------------------------------ ### Some Examples $$\text{ }$$ - **A simple wage equation**: $$wage = \beta_0 + \beta_1 educ + u$$ - $\beta_1$ measures the change in hourly wage given another year of education, holding all other factors fixed - $u$ represents labor force experience, tenure with current employer, work ethic, intelligence, etc. $$\text{ }$$ - **Soybean yield and fertilizer:** $$yield = \beta_0 + \beta_1 fertilizer + u$$ - $\beta_1$ measures the effect of fertilizer on yield, holding all other factors fixed - $u$ represents unobserved (or omitted) factors like Rainfall, land quality, presence of parasites, etc. ------------------------------------------------------------------------ ### Conditional mean independence assumption {#sec-causal_interpretation} When is a **causal** interpretation of @eq-1b.1 justified? - **Conditional mean independence assumption** $$ E(u \, | \, x) = E(u) = 0 $$ {#eq-1b.2} - The explanatory variable must not contain any information about the mean of the unobserved factors in $u$ - So knowing something about $x$ doesn‘t give us information about $u$ - This leads to $\frac {dE(u \mid x)}{dx}=0$ as required. If this assumption is satisfied, we actually have a $(x \rightarrow y)$ -- case - Regarding the **wage example** $$wage = \beta_0 + \beta_1 educ + u$$ *ability* is likely an important, but often *unobserved* factor for the obtained wage of a particular individual. As ability is not an *explicit variable* in the model, it is contained within $u$ - The conditional mean independence assumption is unlikely to hold in this case because individuals with more education will also be more capable on average. Knowing something about the education (variable $x$) of a particular individual therefore contains some information about the ability of that individual (which is in $u$) - Hence, $E(u \, | \, x) \neq 0$ easily possible in this case - Basically, we have the $(z_j \rightarrow x \text{ and } z_j \rightarrow y)$ -- problem, with $z_j$ being ability - Regarding the **fertilizer example** a similar argument holds. Typically, a framer uses more fertilizer if the quality of the soil is bad. Therefore, quality of the soil, which is part of $u$, influences both crop yield and the amount of fertilizer used. Hence, we once again have a $(z_j \rightarrow x \text{ and } z_j \rightarrow y)$ -- problem, with $z_j$ being quality of soil - And furthermore, $E(u \, | \, x) \neq 0$, as the amount of used fertlizer (variable $x$) gives as information about the quality of soil, which is part of $u$ $\; \Rightarrow \;$ conditional mean independence assumption is probably violated in this case *** ### Population regression function (PRF) Taking the conditional expectation of @eq-1b.1 we arrive to the so called **population (true) regression function** $$ E(y \, | \, x) \ = \ E(\beta_0 + \beta_1 x + u \, | \, x) \ = \ \beta_0 + \beta_1 x + \underbrace {E(u \, | \, x)}_{= \, 0} $$ {#eq-PRF} Because of @eq-1b.2, this implies $$ E(y \, | \, x) \ = \ \beta_0 + \beta_1 x $$ {#eq-1b.4} - This means that the *average value* of the dependent variable can be expressed as a linear function of the explanatory variable and @eq-1b.4 is, in a certain sense, the best possible predictor of $y$, given the information $x$ and assumption @eq-1b.2 - Furthermore, $$\beta_1 = \dfrac {dE(y|x)}{dx}$$ That means that a one-unit increase of $x$ changes the conditional expected value (the average) of $y$ by the amount of $\beta_1$ (if the conditional mean independence assumption is met) - For a given value of $x$, the distribution of $y$ is centered around $E(y|x)$, as illustrated by in @fig-fig2 which shows a graphical representation of the population regression function ```{r, echo=FALSE, message=FALSE, echo=FALSE} #| fig.align: center #| fig.cap: "Population regression line; Source: @wooldridge_Intro_Econometrics" #| label: fig-fig2 library(magick) img = image_read("popregline.png") image_scale(img, "530") ``` *** ### Estimation - In order to estimate the regression model one needs data, i.e., a *random sample* of $n$ observations $(y_i, x_i), \ i=1, \ldots , n$ - The task is: Fit **as good as possible** a *regression line* through the data points which is an **estimation** of the PRF: $$ \hat y_i = \hat \beta_0 + \hat \beta_1 x_i $$ {#eq-PRF_estimate} - The following @fig-fig3 gives an illustration of this problem ```{r, echo=FALSE, message=FALSE, echo=FALSE} #| fig.align: center #| fig.cap: "Estimated regression line; Source: @wooldridge_Intro_Econometrics" #| label: fig-fig3 library(magick) img = image_read("residuals.png") image_scale(img, "520") #image_trim(img) ``` ------------------------------------------------------------------------ #### **Principle of ordinary least squares -- OLS** What does “*as good as possible*” mean? - We define the regression residuals $\hat u_i$ as (note, a hat, "\^", always denotes an estimated value) $$ \hat u_i \ \equiv \ y_i - \hat y_i \ = \ y_i - \underbrace {\hat \beta_0 - \hat \beta_1 x_i}_{\hat y_i} $$ {#eq-1b.6} - We choose $\hat \beta_0$ and $\hat \beta_1$ so as to minimize the sum of squared regression residuals $$ \underset {\hat \beta_0, \hat \beta_1} {\operatorname {min}} \ \sum_{i=1}^n \hat u_i^2 \ \ \rightarrow \ \ \hat \beta_0, \, \hat \beta_1 $$ {#eq-1b.7} - The resulting **first order conditions** are $$ \dfrac {\partial}{\partial \hat \beta_0}\sum_{i=1}^n \hat u_i^2 \ = \ \dfrac {\partial}{\partial \hat \beta_0}\sum_{i=1}^n (y_i - \hat \beta_0 - \hat \beta_1 x_i)^2 \ = $$ $$ \quad \quad \quad \sum_{i=1}^n -2 (y_i - \hat \beta_0 - \hat \beta_1 x_i) \overset {!}{=} 0 $$ $$ \dfrac {\partial}{\partial \hat \beta_1}\sum_{i=1}^n \hat u_i^2 \ = \ \dfrac {\partial}{\partial \hat \beta_1}\sum_{i=1}^n (y_i - \hat \beta_0 - \hat \beta_1 x_i)^2 \ = $$ $$ \quad \quad \quad \quad \sum_{i=1}^n - 2x_i (y_i - \hat \beta_0 - \hat \beta_1 x_i) \overset {!}{=} 0 $$ *** From these *first order conditions* above we immediately arrive the so called **Normal Equations**, which are two linear equations in the two variables $\hat \beta_0$ and $\hat \beta_1$ $$ \sum_{i=1}^n (\underbrace {y_i - \hat \beta_0 - \hat \beta_1 x_i}_{\hat u_i})= 0 $$ {#eq-1b.8} $$ \sum_{i=1}^n x_i ( {y_i - \hat \beta_0 - \hat \beta_1 x_i}) = 0 $$ {#eq-1b.9} - Dividing by $n$ we get from the first normal @eq-1b.8 $$ \frac {1}{n} \sum_{i=1}^n y_i - \hat \beta_0 - \hat \beta_1 \frac {1}{n}\sum_{i=1}^n x_i = 0$$ - This imply $$ \bar y = \hat \beta_0 + \hat \beta_1 \bar x \ \ \Rightarrow \ \ \hat \beta_0 = \bar y - \hat \beta_1 \bar x $$ {#eq-1b.10} *** For calculating the **slope parameter** $\beta_1$ we insert @eq-1b.10 into the **second normal equation**, @eq-1b.9 $$ \sum_{i=1}^n x_i (y_i - \underbrace {(\bar y - \hat \beta_1 \bar x)}_{\hat \beta_0} - \hat \beta_1 x_i) = 0 $$ - Dividing by $n$ and expanding the sum leads to $$ \frac {1}{n} \sum_{i=1}^n x_i y_i - \bar y \frac {1}{n} \sum_{i=1}^n x_i + \hat \beta_1 \bar x \frac {1}{n} \sum_{i=1}^n x_i - \hat \beta_1 \frac {1}{n} \sum_{i=1}^n x_i^2 = 0 \ \ \Rightarrow $$ $$ \frac {1}{n} \sum_{i=1}^n x_i y_i - \bar y \bar x + \hat \beta_1 \bar x^2 - \hat \beta_1 \frac {1}{n} \sum_{i=1}^n x_i^2 = 0 $$ - Collecting terms by applying the "Steinerschen Verschiebungssatz" we get $$ \frac {1}{n} \sum_{i=1}^n (x_i - \bar x) (y_i - \bar y) - \hat \beta_1 \frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 = 0 $$ - This immediately leads to the **OLS formula for the slope parameter** ::: {.callout-important appearance="simple" icon="false"} $$ \hat \beta_1 \, = \, \dfrac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) (y_i - \bar y)}{\frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 } \, = \, \dfrac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) y_i}{\frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 } $$ {#eq-beta1_hat} ::: > This equals the **sample covariance** of $y$ and $x$ divided by the **sample variance** of $x$ Formula @eq-beta1_hat is only defined if there is some variation in the explanatory variable $x$, i.e., the sample variance of x must not be zero After having calculated $\hat \beta_1$ by the formula in @eq-beta1_hat we get $\hat \beta_0$ by inserting $\hat \beta_1$ into formula @eq-1b.10 ------------------------------------------------------------------------ #### **Algebraic properties of OLS** {#sec-properties} The **first normal equation**, @eq-1b.8, imply: 1. Regression line always passes through the sample midpoint $(\bar x, \bar y)$, according @eq-1b.10 2. The sum (and average) of the residuals is zero: $\sum_{i=1}^n \hat u_i = 0$ according to @eq-1b.8 and the definition in @eq-1b.6 Furthermore, the **second normal equation**, @eq-1b.9 together with the definition of the residuals @eq-1b.6 implies: 3. The regressor $x_i$ and the regression residuals $\hat u_i$ are *orthogonal*:\ $$ \sum_{i=1}^n x_i \hat u_i=0 $$ i.e., are **uncorrelated** > This is the extreme important **orthogonal property of OLS** ------------------------------------------------------------------------ #### **Estimation by Methods of Moments** Another approach for estimating the (true) population parameters $\beta_0$ and $\beta_1$ is the **method of moments** procedure, **MoM** - The basis for this is the **conditional mean independence assumption**, @eq-1b.2 $$E(u \, | \, x) = E(u) = 0$$ This implies that the covariance between $u$ and $x$ is zero: $$ \operatorname {Cov}(x,u) \ = \ E \left[ (x-E(x)) \, (u-0) \right] \ = $$ $$ E(x \, u) - E(x) \underbrace {E(u)}_0 \ = \ E(x \, u) \quad \Rightarrow $$ $$ E(x \, u) = E_x [x \underbrace {E(u | x)}_0 ] = 0 $$ - Hence, we have two (population) moment restrictions $$ E(u) \ = \ E(\underbrace {y-\beta_0-\beta_1 x)}_u=0 $$ {#eq-1b.12} $$ E(x \, u) \ = \ E[x \, (y-\beta_0-\beta_1 x)]=0 $$ {#eq-1b.13} ------------------------------------------------------------------------ The method of moments approach to estimate the parameters imposes these two population moments restrictions on the sample data - In particular: the *population moments* are **replaced** by their *sample counterparts* - The **justification** is as follows: By the **Law of Large Numbers**, **LLN**, the sample moments converge to their population/theoretical counterparts under rather weak assumptions (stationarity, weak dependence). E.g., with increasing sample size $n$ the sample mean of a random variable converge to the expectation of this random variable (compare @thm-LLN) - So we can *estimate the population moments by the corresponding empirical moments*. In particular, we estimate the expectation, E(y), with the arithmetic sample mean $\bar y$, knowing that by the LLN this sample estimator converges to E(y) with increasing sample size - Hence, the population moment conditions, @eq-1b.12 and @eq-1b.13, can be replaced (estimated) by their corresponding sample means: $$ \frac {1}{n} \sum_{i=1}^n (y_i-\hat \beta_0-\hat \beta_1 x_i)=0 $$ $$ \frac {1}{n} \sum_{i=1}^n x_i \, (y_i-\hat \beta_0-\hat \beta_1 x_i)=0 $$ ------------------------------------------------------------------------ However, the above conditions (which the parameters $\beta_0$ and $\beta_1$ have to meet) are *exactly the same* as the first order conditions from minimizing the sum of squared residuals, the normal equations, @eq-1b.8 and @eq-1b.9, and therefore yield the same solutions. - Hence, OLS and MoM estimation yield the very same estimated parameters $\hat \beta_0$ and $\hat \beta_1$ in this case. (For an additional analysis of MoM estimation, see @sec-matrix) - Furthermore, the OLS estimator is also equal to the **maximum likelihood estimator**, **ML**, assuming normally distributed error terms - Maximum likelihood estimation is treated in more detail in @sec-MLE. Intuitively, ML means that – for a given sample – the estimated parameters are chosen such that the probability of obtaining the respective sample is maximized - Under standard assumptions, OLS, MoM and ML estimators are equivalent (but generally, they can be different!) ------------------------------------------------------------------------ ### An example in R - Install R from https://www.r-project.org - Install RStudio from https://rstudio.com/products/rstudio/download/#download - Start RStudio and install the packages AER and Wooldridge (which we will need very often). For that purpose go to the lower right window, choose the tab *Packages*, then the tab *Install* and enter AER and then click *Install*. If you are asked during the installation whether you want to compile code, type: no (in the lower left window). Repeat the same for the package Wooldridge - To input code use the upper left window. To execute code, mark the code in the upper left window and click on the tap *Run* at the top of the upper left window - You will see the results in the lower left window - To run the examples from these slides, simply copy the code from the slides (shaded in grey) into the upper left window, mark it and run it ------------------------------------------------------------------------ ##### We want to investigate to what extent the success in an election is determined by the expenditures during the campaign. ```{r} #| comment: " " # We use a data set contained in the "Wooldridge" package # We already installed this package, however, if we want to use it in R, # we additionally have to load it with the library() command library(wooldridge) # Loading the data set "vote1" from the Wooldridge packages with the "data" command data(vote1) # printing out the first 6 observation of the data set "vote1" with the command "head()" head(vote1) ``` ------------------------------------------------------------------------ ##### Plotting the percentage of votes for candidate A versus the share of campaign expenditures from A. ```{r, fig.align='center', fig.height=5} plot(voteA ~ shareA, data=vote1) ``` ------------------------------------------------------------------------ ##### Running a regression of voteA on shareA with the command `lm()` (for linear model) ```{r, fig.align='center', fig.height=5} #| comment: " " out <- lm(voteA ~ shareA, data=vote1) # We stored the results in a list with the freely chosen name "out" # With coef(out) we print out the estimated coefficients # Try to interpret the estimated coefficients coef(out) # With fitted(out) we store the fitted values yhat <- fitted(out) # With residuals(out) we store the residuals uhat <- residuals(out) ``` ##### Checking the orthogonal property of OLS -- the correlation between explanatory variable $x$ and the residuals $\hat u$. ```{r} #| comment: " " round( cor(uhat, vote1$shareA), digits = 14) ``` ------------------------------------------------------------------------ ```{r, fig.align='center', fig.height=4.8} # Previous plot plus estimated regression line. plot(voteA ~ shareA, data=vote1) abline(out) ``` ------------------------------------------------------------------------ ##### Plotting residuals. These should show no systematic pattern. ```{r, fig.align='center', fig.height=4.8} plot(uhat) abline(0,0) ``` ------------------------------------------------------------------------ ##### Plotting predicted values versus actual values of voteA. Are predictions biased? ```{r, fig.align='center', fig.height=4.8} plot(yhat ~ voteA, data=vote1) # 45° line abline(0,1) ``` ------------------------------------------------------------------------ ##### Plotting squared residuals versus fitted values. Useful for detecting a varying variance (heteroscedasticity) ```{r, fig.align='center', fig.height=4.8} plot(uhat^2 ~ yhat, data=vote1) ``` ------------------------------------------------------------------------ #### Discussion of example This simple model for the success in an election seems very plausible, however *it suffers from a very common problem* - In this particular example, the *conditional mean independence assumption* is almost certainly violated. Why? - Because the campaign expenditures strongly depend on donations from supporters. The stronger a candidate is in a particular district the more donations he will get and the higher will be the potential campaign expenditures - Hence, we have a reversed causality problem here, $\ x \leftrightarrows y$, or a third variable problem $z_j \rightarrow x \text{ and } z_j \rightarrow y$, which both lead to $E(u|x) \neq 0$ in general - This probably will lead to a strong *overestimation* of the effects of campaign expenditures on votes in this particular case - Note that although $x$ is very likely correlated with unobserved factors in $u$, the example above showed that the **correlation between** $x$ and the sample residuals $\hat u$ **is zero** -- orthogonality property of OLS. Hence, **this fact says nothing** about whether the conditional mean independence assumption is satisfied or not - A possible remedy: *Multiple regression model* (with variables $z$ as additional variables included in the set of explanatory variables) or tying to *identify* the $x \rightarrow y$ relationship with external information (like instrumental variables; we will deal with this approach in @sec-IV) ------------------------------------------------------------------------ ### Measures of Goodness-of-Fit {#sec-R2} How well does the explanatory variable explain the dependent variable? **Measures of Variation** $$ SST = \sum\nolimits_{i=1}^n (y_i - \bar y)^2, \quad SSE = \sum\nolimits_{i=1}^n (\hat y_i - \bar y)^2, \quad SSR =\sum\nolimits_{i=1}^n \hat u_i^2 $$ - *SST* is *total sum of squares*, represents total variation in the dependent variable - *SSE* is *explained sum of squares*, represents variation explained by regression - *SSR* is *residual sum of squares*, represents variation not explained by regression **Decomposition of total variation**, (because of $y_i = \hat y_i + \hat u_i$, $\sum_i x_i \hat u_i=0$ and $\sum_i \hat u_i=0$) $$ SST = SSE + SSR $$ {#eq-SST} **Goodness-of-Fit measure** $$ R^2 \ \equiv \ \dfrac {SSE}{SST}\ = \ 1 - \dfrac {SSR}{SST} $$ {#eq-R2} The **R-squared** *measures the fraction of the total variation in* $y$ that is explained by the regression ------------------------------------------------------------------------ #### Example ```{r} #| comment: " " # Running once more the regression of voteA on shareA with the command lm() out <- lm(voteA ~ shareA, data=vote1) # Printing a summary of the regression summary(out) # Caution: A high R-squared does not mean that the regression has a causal interpretation! ``` ------------------------------------------------------------------------ ### Statistical Properties of OLS - The OLS parameters estimates (estimated coefficients) are functions of random variables and thus **random variables themselves** - We are interested in the moments and the distribution of the estimated coefficients, especially in the expectations and variances - Three questions are of particular interest: - Are the OLS estimates unbiased, i.e., $E(\hat \beta_i) = \beta_i \, ?$ - How precise are our parameter estimates, i.e., how large is their variance $\operatorname {Var}(\hat \beta_i) \; ?$ - How are the estimated OLS coefficients distributed? ------------------------------------------------------------------------ #### Unbiasedness of OLS ::: {#thm-unbiased0} ## Unbiasedness of OLS Given a *random sample* and *conditional mean independence* of $u_i$ from $x$ we state: $$E(\hat \beta_0)=\beta_0, \ \ E(\hat \beta_1)=\beta_1$$ ::: ::: {.callout-tip collapse="true" icon="false"} ## Proof of @thm-unbiased0 From @eq-beta1_hat we have $$ \hat \beta_1 = \dfrac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) y_i}{\frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 } \ $$ {#eq-1b.11b} We substitute for $y_i = \beta_0 + \beta_1 x_i + u_i$ $$ \hat \beta_1 = \dfrac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) (\beta_0 + \beta_1 x_i + u_i)}{\frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 } \ = $$ $$ \beta_0 \underbrace{ \left[ \dfrac { \frac{1}{n} {\sum_{i=1}^n (x_i - \bar x)} }{\frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 }\right]}_0 + \beta_1 \underbrace{ \left[ \dfrac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) x_i }{\frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 }\right]}_1 + \dfrac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) u_i }{\frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 } \ \ \Rightarrow $$ $$ \hat \beta_1 \, = \, \beta_1 + \underbrace{\frac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) u_i } { \frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2}}_{s_x^2} $$ {#eq-1b.17} Taking the conditional expectation, considering the *conditional mean independence assumption* $$ E(\hat \beta_1 | x_1, \ldots, x_n ) \ = \ \beta_1 + \dfrac {1}{s_x^2} \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) \underbrace {E [ u_i | x_1, \ldots, x_n ]}_0 \ = \ \beta_1 \ \ \text{ and } $$ $$ E(\hat \beta_1) = E_x [E(\hat \beta_1 | x_1, \ldots, x_n )] = E_x(\beta_1)=\beta_1 $$ by the *law of iterated expectations* ::: **Interpretation of unbiasedness** - The estimated coefficients may be smaller or larger than the true values, depending on the sample which is the result of a random draw - However, **on average**, they will be equal to the true value (on average means with regard to repeated samples) - In a **given sample**, estimates **may differ considerably from true values** ------------------------------------------------------------------------ #### Variances of the OLS estimates - Depending on the sample, the estimates will be nearer or farther away from the true values - How far can we expect our estimates to be away from the *true* population values on average? (=sampling variability or sampling errors) - Sampling variability is measured by the estimators' *variances* - We need *an additional assumption* to easily calculate these variances: **Homoscedasticity** of $u_i$ $$ \operatorname{Var}(u_i| x_1, \ldots, x_n) = \sigma^2 $$ {#eq-1b.19} - The values of the explanatory variable must not contain any information about the *variability of the unobserved factors* - Together with the conditional mean independence assumption this furthermore implies that the *conditional* variance of $u$ is also equal to the *unconditional* variance of $u$ $$ \operatorname{Var}(u) = E_x [ \underbrace{ E ( u^{2} | x)-[ E(u | x)]^{2} }_{ \operatorname{Var}(u_i| x_1, \ldots, x_n) = \sigma^2 } ] = E_x [ E ( u^{2} | x ) ] = E_x(\sigma^2) = \sigma^2 $$ - The *square root* of $\sigma^2$ is $\sigma$, the **standard deviation** of the error ------------------------------------------------------------------------ - Example: y = f(x) + u ```{r, echo=FALSE, message=FALSE, echo=FALSE} #| fig.align: center #| fig.cap: "Homoscedastic errors; Source: Wooldridge 2020" #| label: fig-fig4 library(magick) img = image_read("Homosekedasticity.png") image_scale(img, "550") ``` ------------------------------------------------------------------------ - Example: wage = f(education) + u ```{r, echo=FALSE, message=FALSE, echo=FALSE} #| fig.align: center #| fig.cap: "Heteroscedastic errors; Source: Wooldridge 2020" #| label: fig-fig5 library(magick) img = image_read("Heteroskedasticity.png") image_scale(img, "550") ``` ------------------------------------------------------------------------ ::: {#thm-var0} ## Variance of OLS estimators Under random sampling, conditional mean independence of $u_i$ from $x$ and homoscedasticity we have $$ \operatorname{Var}(\widehat{\beta}_{1} | x_1, \ldots , x_n ) = \dfrac{\sigma^{2}}{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}=\dfrac{\sigma^{2}}{S S T_{x}} = \frac {1}{n} \dfrac {\sigma^2}{s_x^2} $$ {#eq-var_beta1} $$ \operatorname{Var} (\widehat{\beta}_{0} | x_1, \ldots , x_n ) = \dfrac{\sigma^{2} \frac {1}{n} \sum_{i=1}^{n} x_{i}^{2}}{\sum_{i=1}^{n}\left(x_{i} - \bar{x}\right)^{2}} = \dfrac{\sigma^{2} \, \bar {x^{2}}} {SST_{x}} = \frac {1}{n} \dfrac {\sigma^2}{s_x^2} \, \bar {x^{2}} $$ {#eq-var_beta0} ::: ::: {.callout-tip collapse="true" icon="false"} ## Proof of @thm-var0 From the proof of @thm-unbiased0, we use @eq-1b.17 $$ \hat \beta_1 \ = \ \ \beta_1 + \dfrac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) u_i }{ {s_x^2} } $$ Hence, according to @eq-1b.19 and random sampling we have $$ \operatorname{Var}(\hat \beta_1|x_1, \ldots, x_n) \ = \ \dfrac { \frac{1}{n^2} \sum_{i=1}^n (x_i - \bar x)^2 \operatorname {Var} ( u_i |x_1, \ldots, x_n )}{ { (s_x^2)^2 } } \ = \ \dfrac {\sigma^2}{SST_x} \ = \ \frac {1}{n} \dfrac {\sigma^2}{s_x^2} $$ For the *unconditional* variance we have, (which is rarely used) $$ \operatorname{Var}(\hat \beta_1) \ \equiv \ E \left[ \left( \hat \beta_1-E(\beta_1) \right)^2 \right] \ = \ \ E_x \! \left[ \underbrace { E \left( (\hat \beta_1 - \beta_1)^2|x_1, \ldots, x_n \right) }_{\operatorname{Var}(\hat \beta_1|x_1, \ldots, x_n]} \right] \\ = \ E_x \! \left[ \dfrac {\sigma^2}{n \, s_x^2} \right] \ = \ \dfrac {1}{n} E_x \! \left[ \dfrac {\sigma^2}{s_x^2} \right] $$ ::: The *sampling variability* of the estimated regression coefficients will be the **lower**, - the *smaller* the variability of the unobserved factors $\sigma^2$ - the *higher* the variation in the explanatory variable $s_x^2$ - In particular, the ratio of $\sigma / s_x$ is crucial - the *larger* the sample size $n$ ------------------------------------------------------------------------ #### Estimating the variance of error term - According to our homoscedasticity assumption the *variance of the error term* $u$ is independent of the explanatory variables $$ \operatorname {Var}(u \, | \, x) = \sigma^2 = \operatorname {Var}(u) $$ - However, $\sigma^2$ is usually *unknown*, so we need an **estimator** for this parameter - A natural procedure is to use the **variance of the sample residuals** (note, $\bar {\hat u}_i = 0$, which is an OLS property, see @sec-properties, #2) ::: {.callout-important appearance="simple" icon="false"} $$ \hat \sigma^2 = \dfrac {1}{n-2} \sum_{i=1}^n (\hat u_i - \bar {\hat u}_i)^2 \ = \ \dfrac {1}{n-2} \sum_{i=1}^n \hat u_i^2 $$ {#eq-sigma_hat} ::: $$ \text{and} \quad S.E. \ \equiv \ \hat \sigma \ = \ \sqrt{\hat \sigma^2} $$ {#eq-1b.23} - This estimator turns out to be **unbiased** under or assumptions (see @thm-unbiased1) - Note that we divide by $(n-2)$ and not by $n$ to calculate the average above. The reason is that for calculating $\hat u_i$, we priorly need to estimate two parameters, $\beta_0, \beta_1$. This means, that knowing this two estimated parameters, only $(n-2)$ observations are informative -- if we take these two estimated parameters together with $(n-2)$ observations we could infer the remaining two observations. Therefore, this last two observations contain no additional information - The number $(n-2)$, which is the *number of observations minus the number of estimated model parameters*, is referred to as **degrees of freedom** ------------------------------------------------------------------------ #### Standard errors for regression coefficients Having an estimate for $\sigma^2$ and the standard error *S.E.*, we are able to estimate the standard errors of the parameter estimates **Calculation of standard errors for regression coefficients** Using formulas @eq-var_beta1 and @eq-1b.23 we arrive to $$ se(\hat \beta_1) \ = \ \sqrt{\widehat {\operatorname {Var}}(\hat \beta_1 | x_1, \ldots , x_n)} \ = \ \sqrt{\dfrac {\hat \sigma^2}{SST_x} } $$ {#eq-se} - The estimated standard deviations of the regression coefficients are called **standard errors**. They measure how precisely the regression coefficients are estimated ------------------------------------------------------------------------ ##### The following figures should illustrate the theoretical concepts discussed above ```{r} #| layout-ncol: 2 #| layout-nrow: 2 #| label: fig-PRL_regline_unbiased #| fig-cap: "Population regression function (PRF), estimated regression functions, unbiasedness and variance of estimates" #| fig-width: 5.8 #| fig-height: 4.5 #| fig-subcap: #| - "Population regression function (red) and estimated regression function of a particular sample (blue), 120 observations, compare @fig-fig2 and @fig-fig3" #| - "Unbiasedness: Histogram of 5000 estimates of $\\beta_1$ based on random draws of $u$ and $x$ with 120 observations each. True value of $\\beta_1$ is 0.5" #| - "Variance of $\\hat \\beta_1$: 100 random samples with 120 observations each. PRF in red, estimated regression functions in grey" #| - "Variance of $\\hat \\beta_1$ with smaller ${\\sigma} / {\\sigma_x}$: 100 random samples with 120 observations each. PRF in red, estimated regression functions in grey. Variance of $\\hat \\beta_1$ is much smaller" #| code-fold: true ## Monte Carlo simulation for regressions with one explanatory variable ##################### definition of function ######################################### sims <- function(n=120, rep=5000, sigx=1, sig=1) { set.seed(13468) # seed for random number generator # true parameters B0 = 0 B1 = 0.5 OLS <- vector(mode = "list", length = rep) # initialing list for storing results SOLS <- vector(mode = "list", length = rep) # initialing list for storing results OLS1 <- vector(mode = "list", length = rep) # initialing list for storing results SOLS1 <- vector(mode = "list", length = rep) # initialing list for storing results ######################### rep loop ################################################# for (i in (1:rep)) { x = rnorm(n, mean = 0, sd = sigx) u = rnorm(n, mean = 0, sd = sig) u1 = u/2 maxx = max(x) minx = min(x) y = B0 + B1*x + u y1 = B0 + B1*x + u1 maxy = max(y) miny = min(y) OLS[[i]] = lm(y ~ x, model = FALSE) OLS1[[i]] = lm(y1 ~ x, model = FALSE) } ########################## end rep loop ############################################ ######################### drawing plots ############################################ # scatterplot with true and estimated reg-line for last regression plot(y ~ x, col="blue") abline(OLS[[i]], col="blue") abline(c(B0,B1), col="red") # rep > 100: histogram of estimated parameter b1 if (rep > 100) { b1_distribution <- sapply(OLS, function(x) coef(x)[2]) hist(b1_distribution, breaks = 30, main="") abline(v=B1, col = "red") } # true and up to 100 estimated reg-lines plot(NULL, col="blue", xlim = c(minx*1.1, maxx*1.1), ylim = c(miny*1.1, maxy*1.1), ylab="y", xlab="x") for ( i in 1:min(100, rep) ) abline(OLS[[i]], col="lightgrey") points(y ~ x, col="blue", xlim = c(minx*1.1, maxx*1.1), ylim = c(miny*1.1, maxy*1.1)) abline(c(B0,B1), col="red") # true and up to 100 estimated reg-lines, smaller sig plot(NULL, col="blue", xlim = c(minx*1.1, maxx*1.1), ylim = c(miny*1.1, maxy*1.1), ylab="y1", xlab="x") for ( i in 1:min(100, rep) ) abline(OLS1[[i]], col="lightgrey") points(y1 ~ x, col="blue", xlim = c(minx*1.1, maxx*1.1), ylim = c(miny*1.1, maxy*1.1)) abline(c(B0,B1), col="red") } ######################### end of function ############################################ ## Calling function `sims()` with default values for parameters sims() ``` *** ### Example once more We repeat the regression output from our voting example. Look for the new concepts we just discussed in the regression output shown blow. ```{r} library(modelsummary) # Running once more the regression of voteA on shareA with the command lm() out <- lm(voteA ~ shareA, data=vote1) modelsummary(list("Vote for candidate A"=out), shape = term ~ statistic, statistic = c('std.error', 'statistic', 'p.value', 'conf.int'), stars = TRUE, gof_omit = "A|L|B|F", align = "ldddddd", output = "gt") ```

1.1 Models and Data

What is econometrics?

Typical goals of econometric analysis

Steps in econometric analysis

1.1.1 Economic models

Economic model of crime (Becker (1968))

Economic Model of job training and worker productivity

1.1.2 Econometric models

Econometric model of criminal activity

Econometric model of job training and worker productivity

1.1.3 Data

1.2 Causality

1.2.1 Some Examples

“Post hoc, ergo propter hoc” fallacy 2

Further examples

1.3 The Simple Regression Model

Definition of the simple linear regression model:

1.3.1 Some Examples

1.3.2 Conditional mean independence assumption

1.3.3 Population regression function (PRF)

1.3.4 Estimation

Principle of ordinary least squares – OLS

Algebraic properties of OLS

Estimation by Methods of Moments

1.3.5 An example in R

We want to investigate to what extent the success in an election is determined by the expenditures during the campaign.

Plotting the percentage of votes for candidate A versus the share of campaign expenditures from A.

Running a regression of voteA on shareA with the command lm() (for linear model)

Checking the orthogonal property of OLS – the correlation between explanatory variable x and the residuals \hat u.

Plotting residuals. These should show no systematic pattern.

Plotting predicted values versus actual values of voteA. Are predictions biased?

Plotting squared residuals versus fitted values. Useful for detecting a varying variance (heteroscedasticity)

Discussion of example

1.3.6 Measures of Goodness-of-Fit

Example

1.3.7 Statistical Properties of OLS

Unbiasedness of OLS

Variances of the OLS estimates

Estimating the variance of error term

Standard errors for regression coefficients

The following figures should illustrate the theoretical concepts discussed above

1.3.8 Example once more

“Post hoc, ergo propter hoc” fallacy ²

Running a regression of voteA on shareA with the command `lm()` (for linear model)